Speech synthesizer



. KAzuo NAKATA ETA LV Oct. 6, 1910 SPEECH SYNTHESIZER I 7 Sheets-Sheet 5Filed Nov. 25, 1968 E Em m m 5w XQQE E g INVENTORS K9200 Nam? wBYATTORNEYS Oct. 6, 1970 KAZUO NAKATA ETAL 3,532,821

SPEECH SYNTHESIZER 7 Sheets-Sheet 6 Filed Nqv. 25, 1968 INVENTORS kozuaArmin 0km lav/4mm a I M ATTORNEB 0a. 6,.1970 KAZUO NAKATA ET AL,3,532,821

SPEECH SYNTHESIZER I v 7 SheeisSheet 7 Filed Nov. 25, 1968 vkv mNvis mw.SQR

mmkmbm 9,38% I... WQV

INVENTORS knzao M91097 Amen [CHI/(0M BY ATTORNEYS United States Patent C3,532,821 SPEECH SYNTHESIZER Kazuo Nakata, Kokubunji-shi, and AkiraIchikawa, Musashino-shi, Japan, assignors to Hitachi, Ltd., Tokyo,Japan, a corporation of Japan Filed Nov. 25, 1968, Ser. No. 778,560Claims priority, application Japan, Nov. 29, 1967, 42/76,093 Int. Cl.G101 1/00 US. Cl. 179-1 2 Claims ABSTRACT OF THE DISCLOSURE A speechsynthesizer for compiling a speech from prerecorded acoustic elements,wherein said acoustic elements are classified into two groups, one beinga number of damped sinusoidal waves of different frequencies from whichvowels and transient sounds of speech are produced according to controlsignals, the other being continuous signals having features ofrespective consonants, and the speech being synthesized by combiningselected ones of said vowels and transient sounds with selected ones ofsaid continuous signals corresponding to consonants.

This invention relates to a speech synthesizer, particularly to a systemfor artificially reproducing speech by compiling pre-recorded acousticelements (hereafter referred to as a pre-record and compilation system).

In a pre-record and compilation system, as a prerecorded unit is usuallyemployed a word. Therefore, in order to increase the amount of speechsynthesizable and to expand its application range from a limitedparticular field to a more generalized scope, a drastic increase in thenumber of units of speech (or words) to be pre-recorded is required.Such increase of the prerecorded words inevitably invites bulkiness andcomplication of the system and further, increases the access timerequired for reading out wanted words.

One approach for solving the above problem may be to store, aspre-recorded units or the acoustic elements, mono-syllables instead ofwords. With this method, however, it is known that the quality of thecompiled speech is poor both in clearness and in naturalness. A reasonfor this inferior quality of the thus synthesized speech will be that aword made by combination of syllables is very different in thecharacteristic features of the component syllables such as thefrequencies of formants, the intensity of envelope, the frequency ofpitch, and the duration, from the naturally pronounced same word in anintegral speech uttered with a particular meaning. To solve thisproblem, the only way is to increase the number of the pre-recordedunits or acoustic elements, which contradicts the purpose for whichsyllables are adopted as the elements to be stored.

The main object of this invention is to provide an improved speechsynthesizer of the pre-record and compilation system in which the abovedefects have been removed. Namely, the objects of this invention are toincrease the variety of the synthesizable speech, to minimize the numberof the acoustic elements to be stored as basic units or constituents ofsynthesized speech, to improve the quality of the synthesized speech,especially ICC.

in the naturalness of the speech, and to reduce the size of the system.

According to this invention, as an acoustic element to be stored, voicedsounds each having a constant repetition rate and consonants includingnasal consonants, unvoiced consonants and voiced consonants areemployed. Each voiced sound is produced by selectively reading out andcompiling together, at varied intervals determined by a control signal,a number of damped sinusoids of different frequencies which have beenprerecorded on a recording medium. On the other hand, the consonant partis compiled of a number of naturally pronounced consonants orsynthesized consonants which represent the characteristic features ofthe natural consonants. These constituent consonants are pre-recorded ina recording medium, and read out under control of a control signal withthe timing of the read-out and the duration being controlled.

This invention will be described in detail with reference to theaccompanying drawings, in which FIGS. 1a, 1b and 10 show a waveform ofspeech and the spectrum characteristics thereof;

FIGS. 2a, 2b, 2c and 2d show a waveform of a particular oscillation andthe specrum characteristics thereof;

FIGS. 3 and 4 are schematic diagrams illustrating the synthesis ofwaveforms by means of a magnetic drum;

FIG. 5 is a block diagram of an embodiment of a speech synthesizeraccording to this invention; and

FIGS. 6 and 7 are diagrams for explaining the operation of the essentialportions of the above embodiments.

Fundamentally, a voice is produced when either of a voiced sound causedby the vibration'of vocal chords which is almost periodically repeatedintermittent triangular waves or an unvoiced sound caused by a turbulentflow produced by contract of the vocal tract which is almost whiterandom noise is passed through a cavity formed in the vocal tract, thatis, an articulatory organ formed from the glottis to the lips. In FIG.1a which shows a part of a waveform of a speech, the section indicatedby reference numeral 1 represents a voiced sound in which the repetitionrate of a vocal base is constant, and the section 2 a consonant. Thefrequency spectrums of the above sounds 1 and 2 are characterized, asshown in FIGS. 1b and 10 respectively, by spectrum envelopes 3 which areindications of resonant characteristics of the articulator and by theinner structure of the spectrum indicating the features of the vocalbase, the former being further characterized mainly by severalsingle-resonance characteristics (that is, formants) 4, 4, 4", 5, 5' andthe latter being characterized mainly by a harmonic line spectrum 6,which possesses periodicity and randomness of continuous spectrum.

According to this invention, it facilitates to synthesize a voiced soundof a constant repetition rate, for example, having a characteristicspectrum as shown in FIG. 1b from a number of pre-recorded dampedsinusoidal waves of different frequencies. The principle of thissynthesis will be explained hereunder.

A damped sinusoidal oscillation as shown in FIG. 2a gives asingle-resonant frequency spectrum as shown in FIG. 2b, said dampedsinusoidal oscillation being represented by the formula e sin w t, Wherea is the damping factor, t the time, and w, the angular frequency of theoscillation. If the damped sinusoidal oscillations are repeated at aconstant period T as shown in FIG. 20, the frequency spectrum thereofwill becomea harmonic line spectrum as shown in FIG. 2a. It is known inthe acoustical theory on production of voice that the spectrum envelope3 as shown in FIG. lb is produced by continuously cascading thesingle-resonant features as shown in FIG. 2b. Therefore, such a voicedsound as the period the relative amplitude of the second formant is (w/w the relative amplitude of the third formant is where e1 m and @11indicate respectively an angular frequency of the first, the second andthe third formants of the voice.

Further, a transient sound from a voiced sound of a constant repetitionrate, that is, a sound having a particular frequency spectrum, toanother sound having another frequency spectrum, can be synthesized,with suflicient smoothness, according to the following steps; that is,quantizing a variation in frequencies of the characteristic formants ofthe respective sounds between said two voices; synthesizing sounds byadding the damped sinusoidal oscillations as described above; and thenjoining the sounds consecutively.

Accordingly, in the speech synthesizer of this invention, the number ofthe acoustical elements to be pro-recorded is required only to be enoughto cover, with an appropriate space, the frequency bands which areessential for constituting a speech (including as far as the first, thesecond and the third formants). An example of such a number as realizedin an embodiment of this invention is shown in the following Table 1.

TABLE 1.AN EXAMPLE OF THE NUMBER OF THE RE- CORDED ACOUSTICAL ELEMENTSOF DAMPED SLNUS- OIDAL OSCILLATION As to the consonant portions of thevoice (nasal consonant, unvoiced consonant and vocal or voicedconsonant), it is only required to pre-record signals corresponding tothe features of respective consonants. The number of such signals is atmost 16 as shown in the following Table 2.

illustrates schematically a magnetic drum on which the above-mentioneddamped sinusoidal oscillations are recorded.

Assuming that the lowest frequency of the pitch of the speech to besynthesized is 50 Hz., the damped sinusoidal oscillations are recordedfor 20 ms. which corresponds to one revolution period of the drum. (Thismeans that the time constant of damping is assumed to be about 20 ms. atmost. This assumption will be appropriate in view of the band width ofvowel formants.) For example, ten read-out heads are disposed at anequal space along the circumference of the drum, the time differencebetween two adjacent heads will be 2 ms. This is the minimumcontrollable step of the pitch period, and the pitch frequency iscontrolled according to the selection of readout with the following tensteps: 50 112., 55.5 Hz., 62.5 Hz., 715 Hz., 835 Hz., Hz., Hz., 166 Hz.,250 Hz. and 500 Hz. It Will be understood that these steps can beshortened by increasing the number N of the heads.

Referring to FIG. 3, it is assumed that the I -th head is reading at aninstant and that T is the time interval between the reading with the I-th head and with I -th head. If the next reading is started when thebeginning of the recorded signal comes to the position of the (I +k)-thhead, the interval between two readings Will become longer by Tkseconds. While, if it is started at the position of the (I k)-th head,the interval will become shorter by Tk seconds. (T indicates the timethat the rotating drum takes to move from a head to the next head.)Assuming that the recorded signal is read out by a head continuously forone revolution of the magnetic drum, that is, for 20 ms., it will beseen from FIG. 4 that the beginning portion of a reading period overlapsa portion of the signal read by the preceding acting head and the endingportion overlaps with a portion of the signal read by the ensuing head,thus the transition of physical features is achieved more smoothly,resulting in an improved quality of the synthesized speech.

In the following paragraphs, the pre-record and compilation type speechsynthesizer of this invention will be described in detail in connectionwith an embodiment of the invention.

In FIG. 5 which is a block diagram of an embodiment of this invention, amultiple output system of 11 channels is shown. Constitutents of thesentence to be converted to a speech, which are selected in the mainapparatus 10 of the information processing system (usually, a commonlarge high-speed electronic computor), are immediately converted intospeech output control signals 11, 12, 111, with reference to a magneticdrum 20 which contains a pronounce dictionary (a stock of controlsignals for speech units to be articulated), and then are distributed tocontrol signal decoders 101, 102, 1012 for the respective channels Wherethe distributed control signals are decoded to a group of more tangiblecontrol LEMENTS TO BE RECORDED Fricative sounds Plosive sounds Nasalsounds Number Number Numb er of of of Consonants elements Oonsonantselements Consonants elements Therefore, the total number of theacoustical elements signals 21, "22, 211 for reading the recordedacoustito be recorded will be of the order of fifty. In order to improvethe naturalness of the thus compiled speech, it is required to controlthe period of the above-described repetitive reproduction of the dampedsinusoidal oscillations in accordance with the pitch period of thespeech to be synthesized. A tangible method of such a control will calelements. A part of the decoded signals is led to the recorded elementsselecting gate matrixes 201, 202, 2011, while the remaining part is ledto groups of controlling analogue multipliers (311, 312, 313), (321,322, 333) (3111, 3112, 3113) for controlling the relative amplitudes ofthe readout signals. Thus, a specibe described hereunder with referenceto FIG. 3 which 75 fied acoustical element is read out through aspecified head on the elements storage drum 30 at a specified time; andthen the relative amplitude is controlled as required. The amplitudecontrolled outputs are led to summing amplifiers 314, 324, 3n4 in therespective channels and are added to each other, and then are controlledin regard to the intensity in multipliers 315, 325-, 3115 as requiredfor a phoneme and integral speech. After that, the outputs are combinedwith consonants in summing amplifiers 316, 326, 3n6 to become theresultant vocal outputs 31, 32, 3n. The above-described process isrepeated, for example, every ms., thereby to produce a continuous speechoutput.

Next, the essential components of the system will be described in moredetail. As has been explained already, according 0t this invention, avoice is separated into two parts, that is: (1) vowels and transientsounds (including semivowel and fluent sound) and (2) consonants(unvoiced consonants, voiced consonants and nasal consonants). Insynthesizig speech, the part 1) is produced by reading out repeatedlybut in varied periods the recorded damped sinusoidal waveforms, and thepart (2) by directly reading out the required waveforms out of therecorded consonantal ones, and finally both parts are combined. It isknown already that the fricative sounds and plosive sounds can beproduced by increas ing the overlap of the consonant part (2) and thevowel and transient part, and the plosive sounds also by making steepthe vowel and transient part. Therefore, any syllable can be synthesizedfrom the above-described parts (1) and (2).

Of the parts 1) and (2), only the part (1) is necessary to be repeatedlyread out at varied periods, and the variable periods are common to allof the first, second and third formants.

Therefore, the read-out of the recorded acoustical elements will beexplained hereunder relating to a particular channel. The acousticalelements recorded on the magnetic drum are classified into twocategories, that is: a group of damped sinusoidal waves used for thesynthesis of the above-described part (1) and a group of conso nantalwaves. The first group is divided into three ranges overlapping eachother in the fringe portions, that is: the first formant range (16channels from 200 to 950 Hz.), the second formant range (16 channelsfrom 800 to 2,400 Hz.) and the third formant range (8 channels from2,200 to 3,500 Hz.). In order to simplify the structure for control,channels on the magnetic drum are divided corresponding to the above twocategories, the first category being further divided into three zones,namely, the first, second and third zones. Thus, the recording channelsof the drum are divided into four zones. That is, as shown in FIG. 6,the elements storage drum 400 is divided into four zones 401, 402, 403and 404. Outputs of reading heads for respective channels in said fourzones are led to the gate matrixes 411, 412, 413 and 414 for selectingoutputs. Of said four gate matrixes, the matrixes 411, 412 and 413 forcomposing the formants are supplied commonly with a head selectingsignal 451, while the remaining matrix 414 is supplied with a signal 452for selecting the consonant reading head.

In order to determine which channel (frequency) should be selected inthe respective zones, frequency selecting signals 461, 462 and 463 aregiven to the respective matrixes as the first, second and third formantsshould be independently controlled. According to these control signals,damped sinusoidal waves of different frequencies (corresponding to theformant frequencies) repeatedly read out at particular periods(corresponding to the pitch periods) are obtained at output terminals471, 472 and 473 of the gate matrixes 411, 412 and 413. The outputs frommatrixes 412 and 413 are controlled as to the relative amplitudes to theoutput from the matrix 411 in analogue multipliers 422 and 423 withreference to control signals 465 and 466, and then added to the latteroutput in a summing amplifier 431. Output from the summing amplifier 431is further controlled as to the amplitude in an analogue multiplier 441with reference to a control signal 481 so as to give a good effect ofvocal sound and speech, and then let out through an output terminal 490as a continuous speech.

If a consonant is required, the consonant selected by the matrix 414 isadded to the vowel and transient sound in a summing amplifier 440, afterthe consonant is imparted an appropriate control of the amplituderelative to the vowel and transient sound in an analogue multipler424.and with reference to the control signal 468.

FIG. 7 shows in more detail a part of one of the recorded elementsselecting gate matrixes 411, 412, 413 and 414 shown in FIG. 6. As thegate matrixes 411, 412, 413 and 414 are substantially the same inoperation, the following description will be made as to only one ofthem.

In FIG. 7, it is assumed that I recorded channels 1, 2 l on the magneticdrum are to be selectively readout by N reading heads 1, 2 N.

Signal 451 (for the matrixes 411, 412 and 413) or signal 452 (for thematrix 414) which specifies the heads by which the recorded signals areto be read out, is led to a decoding buffer 500 to be decoded therein.The decoding bufi'er 500 supplies output 1 to output lines leading tothe specified heads and output 0 to all of the remaining lines out ofthe lines 501 to 50N.

Meanwhile, signal 461 (for 411), signal 462 (for 412) or signal 463 (for413) which specifies the channels of which the outputs are to be taken,is led to another decoding buffer 600 to be decoded therein. Thedecoding buffer 600 supplies signal 1 to the selected lines and signal 0to the remaining lines out of the lines 601, 602 60!. As to the analogueread-out from each channel on the magnetic drum, outputs from thechannels associated with the 1st heads are connected to terminals 11, 121l respectively, outputs from the channels associated 'with the 2ndheads are connected to terminals 21, 22 2l, and outputs from thechannels associated with the N-th heads are connected to terminals N1,N2 Nl respectively.

Gate selecting signals 501, 502 50N and 601, 602 601 are first connectedto selecting digital AND gates 111, 121 III; 211, 221 2Z1; and N11, N21Nl1 respectively, as shown in the figure. As the result, among the N x lgates only one gate, which receives the specified signal 1, opens togive an output 1 to the associated gate among the ensuing analogue gates112, 122 1Z2; 212, 222 2Z2; N12, N22 Nl2. Thus, the output of thespecified head read from the specified channel is selected.

Further, the decoded output from the decoding amplifier 500 specifiesnot only the head to be selected, but the time at which the signal isread out from the head. (As the signal is always read out from thestarting point of the record, the starting time can be easily determinedfrom the timing pulse on the drum.) Therefore, assuming that the digitalAND gates 111, 112 N11, if they are opened once, maintain the output 1during a complete revolution of the drum (the period being T ms., forexample, 20 ms.), then this selecting gate matrix allows a read-out asshown in FIG. 4.

The read-out outputs are summed and let out from the output amplifier700. This output corresponds to either one of the outputs 471, 472 or473 in FIG. 6.

In the consonant selecting gate matrix, the read-out of a specified headfrom a specified channel is required to continue during the durationinherent to the particular consonant. It is achieved by controlling theduration with the decoded signal from the decoding buffer 500, whereasthe duration is constant (for example, 20 ms.) in the case of vowels.This output corresponds to 474 in FIG. 6.

It will be obvious that the above-described principle of this inventionis equally applied either to a digital type recording of the acousticelements or a cyclic memory consisting of a group of shift registers.However, it will be understood that in a digital recording, adigital-toanalogue converter is required for converting the read-outoutput to an analogue waveform.

What we claim is:

1. A speech synthesizer of the pre-record and compilation typecomprising: a memory on which a number of damped sinusoidal Waves ofdifierent frequencies are recorded; means for selectively reading out atleast one of said sinusoidal waves periodically according to a controlsignal, said period of reading out being variable; a memory on which anumber of continuous signals having features of respective consonantsare recorded; and means for selectively reading out at least one of saidcontinuous signals at a specified time according to a control signal.

2. A speech synthesizer according to claim 1, which further comprisesmeans for compiling the outputs of the first and second mentionedreading out means together.

References Cited UNITED STATES PATENTS KATHLEEN H. CLAFFY, PrimaryExaminer C. W. JIRAUCH, Assistant Examiner

